Introduction to Open Data Science - Course Project

About the project

I’m excited to have this course and hope to learn more on statistics and programming! My GitHub page: https://github.com/ntrmarie/IODS-project


Linear regression analysis

  1. The dataset learning2014 describes the approaches to learning. It includes 7 variables and 166 observations.

Let’s present the variables included into the dataset:

learning2014 <- read.csv("/Users/Maria/R/IODS-project/data/learning2014.csv")
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7
  1. Now we create graphical overview of the data
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# creates the plot matrix for the dataset
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Plots’ description:

The first line of box plots on the upper side presents the gender wise distribution of “age”, “attitude”, “deep”, “stra”, “surf” and “points”. Every plot includes lower and higher quartiles and median. Red colour shows female gender and blue colour - male gender. The first line also shows the gender distribution: the number of females(F) is twice bigger than males(M).

The upper triangle presents coefficients of correlation with gender. The lower triangle presents scatter plots for all the variables and show the relationship between every two variables. The plots based on the diagonal from left upper side to right upper side present the distribution of all the continuous variables in our scope.

The highest correlation coefficients are for the following variables:

In addition: Age distribution is skewed towards young people (mostly 20 years old).

  1. Now we create regression model with multiple variables
# creates regression model with three variables of highest (absolute) correlation with the target variable
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# shows the summary
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Summary of the model:

Therefore, the hypothesis that these parameters are equal to zero is accepted. It means that explanatory variable in our model doesn’t have a statistically significant relationship with the target variable. Now we need to fit another model without a “surf” variable:

# creates regression model with two variables 
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# shows the summary
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
  1. In second model both variables are significant on 1% significance level (p-values are less than 0.1).

The new interpretation:

Multiple R-squared of the model: model explains 20% of variance in the data as far as \(R^2 = 0.2\).

5.Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

# sets the plots' location
par(mfrow = c(2,2))
# creates diagnostic plots
plot(my_model, which=c(1,2,5))


Logistic regression

Data description

The data approach student achievement in secondary education of two Portuguese schools. We are interested in two datasets presenting the performance in Mathematics (math) and Portuguese language (por).

The whole information about data is presented here (https://archive.ics.uci.edu/ml/datasets/Student+Performance).

Having combined two datasets we will analyze the following variables:

  • school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • age - student’s age (numeric: from 15 to 22)
  • address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  • famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  • Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  • guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • schoolsup - extra educational support (binary: yes or no)
  • famsup - family educational support (binary: yes or no)
  • activities - extra-curricular activities (binary: yes or no)
  • nursery - attended nursery school (binary: yes or no)
  • higher - wants to take higher education (binary: yes or no)
  • internet - Internet access at home (binary: yes or no)
  • romantic - with a romantic relationship (binary: yes or no)
  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  • goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  • Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • health - current health status (numeric: from 1 - very bad to 5 - very good)
  • failures - number of past class failures
  • paid - extra paid classes within the course subject (Math or Portuguese)
  • absences - number of school absences
  • G1 - first period grade
  • G2 - second period grade
  • G3 - final grade
# read the data
alc <- read.csv("/Users/Maria/R/IODS-project/data/alc.csv")
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Hypothesis

The purpose of this part is to study the relationships between high/low alcohol consumption and some of the other variables in the data. Let’s choose 4 variables in the data and present four hypothesis about their relationships with alcohol consumption:

  1. alcohol assumption decreases grades - G1, G2, G3
  2. more free time students have, more alcohol they drink - freetime
  3. alcohol assumption is increased by the number of school absences - absences
  4. the worse relationships in the family, the higher alcohol assumption is - famrel

Visualization and explanation

Let’s create the plots to explore our hypothesis:

# count mean for the grades
alc$G <- (alc$G1 + alc$G2 + alc$G3)/3
# check the distribution with bar plots
bar <- select(alc, "G", "freetime", "absences", "famrel")
gather(bar) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

The plots show the following:

  • the number of absences has only one peak that is in 0-10, so mainly students have 0 jr only several absences
  • for main part of students quality of family relations can be estimated as good
  • for main part of students amount of free time can be estimated as average and high
  • the most frequent final grades are in 10-16 gap
# check the distribution with box plots
# for high_use variable TRUE is for students for which 'alc_use' is greater than 2, otherwise FALSE
# G is a mean grade
d1 <- ggplot(alc, aes(x = high_use, y = G)) + geom_boxplot() + ylab("grade")
d2 <- ggplot(alc, aes(x = high_use, y = freetime)) + geom_boxplot() 
d3 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() 
d4 <- ggplot(alc, aes(x = high_use, y = famrel)) + geom_boxplot() + ylab("family relations")

d <- plot_grid(d1, d2, d3, d4, labels = c('a', 'b', 'c', 'd'),align="hv")
d

The box plots show the following:

  1. for mean grade and alcohol assumption(AA) plot the median for students with AA less than 2, is higher and there is an evidence that there are more students from this group that have higher mean grade, so we can confirm our hypothesis: alcohol assumption decreases grades
  2. free time and alcohol assumption(AA) plot presents equal results, so we can’t confirm our hypothesis
  3. for absences and alcohol assumption(AA) plot the median is mainly the same for both students’ groups and the number of absences is low. However, for those with AA grater than 2 we have more observations, so we can confirm our hypothesis: alcohol assumption is increased by the number of school absences
  4. for family relations and alcohol assumption(AA) plot the median is the same and relations in families students mainly estimate as good but the group with AA more than 2, students estimate the relations worse, so we can assume that our hypothesis can be confirmed: the worse relationships in the family, the higher alcohol assumption is

Logistic regression

Let’s create a logistic regression model with the binary high/low alcohol consumption variable as the target and these explanatory variables: “freetime”, “G” (grades), “absences”, “famrel”

model <- glm(high_use ~ freetime + G + absences + famrel, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ freetime + G + absences + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8494  -0.8183  -0.6309   1.1567   2.0665  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.29109    0.80769  -0.360 0.718552    
## freetime     0.44845    0.12807   3.502 0.000463 ***
## G           -0.09783    0.04359  -2.244 0.024810 *  
## absences     0.08204    0.02249   3.648 0.000264 ***
## famrel      -0.34003    0.13016  -2.612 0.008989 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 411.59  on 365  degrees of freedom
## AIC: 421.59
## 
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept)    freetime           G    absences      famrel 
## -0.29108828  0.44845029 -0.09783016  0.08203815 -0.34002935
odds <- coef(model) %>% exp
confi <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(odds, confi)
##                  odds     2.5 %    97.5 %
## (Intercept) 0.7474497 0.1514430 3.6313994
## freetime    1.5658836 1.2235216 2.0238304
## G           0.9068029 0.8315906 0.9870153
## absences    1.0854972 1.0407617 1.1372571
## famrel      0.7117494 0.5501574 0.9182071

The model shows the following:

  • with G, absences and famrel at a fixed value, we see a 106% decrease in the odds of high alcohol consumption for a one-unit increase in freetime since exp(0.44845) = 1.565
  • with freetime, absences and famrel at a fixed value, we see a 9% decrease in the odds of high alcohol consumption for a one-unit increase in grades since exp(-0.09783) = 0.906
  • with freetime, G and famrel at a fixed value, wesee an 8% decrease in the odds of high alcohol consumption for a one-unit increase in absences since exp(0.08204) = 1.085
  • holding freetime, G and absences at a fixed value, we see an 107% decrease in the odds of high alcohol consumption for a one-unit increase in famrel since exp(-0.34003) = 0.711

From p-values we can conclude that our variable are statistically significant.

Predictive power of the model

# remove grades (G)
newmodel <- glm(high_use ~ freetime + G + absences + famrel, data = alc, family = "binomial")
summary(newmodel)
## 
## Call:
## glm(formula = high_use ~ freetime + G + absences + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8494  -0.8183  -0.6309   1.1567   2.0665  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.29109    0.80769  -0.360 0.718552    
## freetime     0.44845    0.12807   3.502 0.000463 ***
## G           -0.09783    0.04359  -2.244 0.024810 *  
## absences     0.08204    0.02249   3.648 0.000264 ***
## famrel      -0.34003    0.13016  -2.612 0.008989 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 411.59  on 365  degrees of freedom
## AIC: 421.59
## 
## Number of Fisher Scoring iterations: 4

Cross tabulation of predictions versus the actual values:

probabilities <- predict(newmodel, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, freetime, G, absences, famrel, high_use, probability, prediction) %>% tail(10)
##     freetime         G absences famrel high_use probability prediction
## 361        4 12.666667        3      4    FALSE   0.2993338      FALSE
## 362        3  4.000000        0      4    FALSE   0.3324388      FALSE
## 363        4  4.666667        7      5     TRUE   0.4800836      FALSE
## 364        3 12.333333        1      5    FALSE   0.1454904      FALSE
## 365        4  7.000000        6      4    FALSE   0.4875059      FALSE
## 366        4  7.000000        2      5    FALSE   0.3277964      FALSE
## 367        3 11.666667        2      4    FALSE   0.2170178      FALSE
## 368        1  6.666667        3      1    FALSE   0.3569208      FALSE
## 369        4 12.666667        4      2     TRUE   0.4779205      FALSE
## 370        4 10.666667        2      4     TRUE   0.3236934      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     90   21

As it can be seen from the table, the model correctly classified 244+21=265 observations and failed with 90+15= 105 observations, the training error is 105/(265+105) = 0.28.

# define average prediction error
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2837838

We can see that the results are the same.

# if try to guess
mean(alc$high_use)
## [1] 0.3

So, with simple guessing we can assume that 30% goes to mistake and the model gives 28%. It means that the model is a bit more precise results.


Clustering and classification

Data description

The dataset presents the Housing Values in Suburbs of Boston. It includes 506 rows and 14 columns. IT can be downloaded from MASS package.

The following variables are included:

  • crim - per capita crime rate by town
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft
  • indus - proportion of non-retail business acres per town
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • nox - nitrogen oxides concentration (parts per 10 million)
  • rm - average number of rooms per dwelling
  • age - proportion of owner-occupied units built prior to 1940
  • dis - weighted mean of distances to five Boston employment centres
  • rad - index of accessibility to radial highways
  • tax - full-value property-tax rate per $10,000
  • ptratio - pupil-teacher ratio by town
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • lstat - lower status of the population (percent)
  • medv - median value of owner-occupied homes in $1000s
# access the MASS package
library(MASS)
## 
## Присоединяю пакет: 'MASS'
## Следующий объект скрыт от 'package:dplyr':
## 
##     select
# load the data
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston) # 506*14
## [1] 506  14

Graphical overview and summaries

Let’s explore the data and check the distribution

library(ggplot2)
library(GGally)
library(tidyr)
gather(Boston) %>% ggplot(aes(x = value))  + geom_histogram(aes(y = ..density..), colour="black", fill="white", bins = 18,) + facet_wrap("key", scales = "free")

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the plots we can conclude that:

  • the variables black, chas,crim,ptratio, zn are highly skewed to different sides: black (skewed to right) shows high propotion of black people, chas (skewed to the left - 0) shows the track doesn’t bound the river, crim (skewed to the left) shows that the crime rate is low, ptratio(skewed to the right) shows that pupil-teacher ratio is rather high, zn (skewed to the left) shows that proportion of residential land zoned for lots over 25,000 sq.ft is low.

Let’s look at the correlations between variables

library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

From the plots we can the following:

  1. The strong positive correlation is between:
  • full-value property-tax rate per$10,000 tax and index of accessibility to radial highways rad: 0.91
  • nitrogen oxides concentration nox and proportion of non-retail business acres per town indus: 0.76
  • nitrogen oxides concentration nox and proportion of owner-occupied units built prior to 1940 age: 0.73
  1. The strong negative correlation is between:
  • weighted mean of distances to five Boston employment centres dis and nitrogen oxides concentration nox: -0.77
  • weighted mean of distances to five Boston employment centres dis and proportion of owner-occupied units built prior to 1940 age: -0.75
  • median value of owner-occupied homes medv and lower status of the population lstar: -0.74

Data wrangling

Let’s scale the data

# center and standardize variables
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)

We can see the following changes after scaling: all the means equal zero.

Let’s create a categorical variable of the crime rate

# create a quantile vector of crime and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
c <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, label = c, include.lowest = TRUE)
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crime variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Let’s divide the data into the train and test sets

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear discriminant analysis (LDA)

For linear discriminant analysis we use the categorical crime rate as the target variable and all the other variables as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2400990 0.2574257 0.2549505 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.05140586 -0.8864965 -0.11484506 -0.9092956  0.45826609 -0.9410647
## med_low  -0.07946536 -0.2606603 -0.02879709 -0.5439900 -0.12218729 -0.3636284
## med_high -0.38664383  0.1841756  0.14409500  0.3729657  0.09991523  0.4501857
## high     -0.48724019  1.0170891 -0.04298342  1.0600741 -0.43299060  0.8038695
##                 dis        rad        tax    ptratio       black       lstat
## low       0.9129302 -0.6867152 -0.7429649 -0.5078805  0.38213068 -0.77966885
## med_low   0.3319333 -0.5402442 -0.4759577 -0.0237465  0.31414792 -0.18861029
## med_high -0.3749797 -0.4617482 -0.3353327 -0.2357294  0.06900514  0.00633043
## high     -0.8454542  1.6384176  1.5142626  0.7811136 -0.71012249  0.86259576
##                  medv
## low       0.578681466
## med_low   0.009547813
## med_high  0.125445154
## high     -0.665974776
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08793077  0.62540182 -0.95112675
## indus    0.07283525 -0.01849409  0.24055952
## chas    -0.10535278 -0.05185597  0.11517851
## nox      0.37481405 -0.86782590 -1.28135552
## rm      -0.15052983 -0.11921800 -0.20250487
## age      0.09035050 -0.43872445 -0.19837762
## dis     -0.07343855 -0.19003496  0.07823450
## rad      3.68208067  1.11009547 -0.08814807
## tax      0.05485828 -0.12890246  0.52715519
## ptratio  0.10918027 -0.12256917 -0.20642771
## black   -0.09152740  0.05439851  0.13469450
## lstat    0.22942371 -0.13363770  0.21679157
## medv     0.22285729 -0.30348726 -0.25274731
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9560 0.0353 0.0088
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

Let’s draw the LDA plot

plot(lda.fit, dimen = 2, col = classes, pch = classes )
lda.arrows(lda.fit, myscale = 2)

Prediction for LDA

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        3    0
##   med_low    4      19        6    0
##   med_high   1       4       14    3
##   high       0       0        0   24

In general this model shows good results for low and high crime level prediction but doesn’t work so good with the middle rate.

K-means

To perform k-means algorithm on our dataset let’s scale the data again

boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
summary(boston_scaled2)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now let’s calculate the distances between the observations with Euclidean distance method

# euclidean distance matrix
dist_eu2 <- dist(boston_scaled2) # function uses Euclidean by default
summary(dist_eu2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
str(dist_eu2)
##  'dist' num [1:127765] 1.94 2.28 2.57 2.69 2.24 ...
##  - attr(*, "Size")= int 506
##  - attr(*, "Labels")= chr [1:506] "1" "2" "3" "4" ...
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = boston_scaled2)

Now we run k-means algorithm on our dataset. Through that we should investigate what is the optimal number of clusters. To determine the number of clusters we should have a look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes.

The optimal number of clusters is when the value of total WCSS changes radically. Then using two clusters seems to be optimal.

library(GGally)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)
# plot the result
ggpairs(boston_scaled2[0:7], aes(color = as.factor(km$cluster)))
## Warning in cor(x, y): стандартное отклонение нулевое

## Warning in cor(x, y): стандартное отклонение нулевое

## Warning in cor(x, y): стандартное отклонение нулевое

## Warning in cor(x, y): стандартное отклонение нулевое

## Warning in cor(x, y): стандартное отклонение нулевое

## Warning in cor(x, y): стандартное отклонение нулевое

ggpairs(boston_scaled2[8:14], aes(color = as.factor(km$cluster)))

We can see that 2 clusters is enough to show the difference between classes. The most evident differnece is in indus, nox, tax, rad, dis, ptratio.


Dimensionality reduction techniques

Description

The dataset includes now the following variables:

  • gni = Gross National Income per Capita
  • leb = Life Expectancy at Birth
  • eye = Expected Years of Education
  • mmr = Maternal Mortality Ratio
  • abr = Adolescent Birth Rate
  • prp = Female Percent Representation in Parliament
  • edu.rat = the ratio of Female and Male populations with secondary education (secedu.f/secedu.m)
  • lab.rat = the ratio of labour force participation of females and males (lfpr.f/lfpr.m)

Exploratory analysis

Let’s explore the dataset

# check the distribution
library(corrplot)
ggpairs(human)

cor(human) %>% round(2)
##           gni   leb   eye   mmr   abr   prp edu.rat lab.rat
## gni      1.00  0.63  0.62 -0.50 -0.56  0.09    0.43   -0.02
## leb      0.63  1.00  0.79 -0.86 -0.73  0.17    0.58   -0.14
## eye      0.62  0.79  1.00 -0.74 -0.70  0.21    0.59    0.05
## mmr     -0.50 -0.86 -0.74  1.00  0.76 -0.09   -0.66    0.24
## abr     -0.56 -0.73 -0.70  0.76  1.00 -0.07   -0.53    0.12
## prp      0.09  0.17  0.21 -0.09 -0.07  1.00    0.08    0.25
## edu.rat  0.43  0.58  0.59 -0.66 -0.53  0.08    1.00    0.01
## lab.rat -0.02 -0.14  0.05  0.24  0.12  0.25    0.01    1.00
cor(human) %>% corrplot(method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

summary(human)
##       gni              leb             eye             mmr        
##  Min.   :   581   Min.   :49.00   Min.   : 5.40   Min.   :   1.0  
##  1st Qu.:  4198   1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  11.5  
##  Median : 12040   Median :74.20   Median :13.50   Median :  49.0  
##  Mean   : 17628   Mean   :71.65   Mean   :13.18   Mean   : 149.1  
##  3rd Qu.: 24512   3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 190.0  
##  Max.   :123124   Max.   :83.50   Max.   :20.20   Max.   :1100.0  
##       abr              prp           edu.rat          lab.rat      
##  Min.   :  0.60   Min.   : 0.00   Min.   :0.1717   Min.   :0.1857  
##  1st Qu.: 12.65   1st Qu.:12.40   1st Qu.:0.7264   1st Qu.:0.5984  
##  Median : 33.60   Median :19.30   Median :0.9375   Median :0.7535  
##  Mean   : 47.16   Mean   :20.91   Mean   :0.8529   Mean   :0.7074  
##  3rd Qu.: 71.95   3rd Qu.:27.95   3rd Qu.:0.9968   3rd Qu.:0.8535  
##  Max.   :204.80   Max.   :57.50   Max.   :1.4967   Max.   :1.0380

From the plot we can the following:

  1. the strong positive correlation is between:
  • expected years of education eye and life expectancy at birth leb: 0.79
  • adolescent birth rate abr and maternal mortality ratio mmr: 0.76
  1. the strong negative correlation is between:
  • maternal mortality ratio mmr and life expectancy at birth leb: -0.86
  • adolescent birth rate abr and life expectancy at birth leb: -0.73

Only the expected years of education eye variable is close to the normal ditribution, other variables are skewed.

Principal component analysis (PCA)

Let’s perform principal component analysis for our dataset

# performing principal component analysis (with the SVD method)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                   PC1           PC2           PC3           PC4           PC5
## gni     -9.999832e-01 -0.0057723054  5.156742e-04 -4.932889e-05 -0.0001135863
## leb     -2.815823e-04  0.0283150248 -1.294971e-02  6.752684e-02  0.9865644425
## eye     -9.562910e-05  0.0075529759 -1.427664e-02  3.313505e-02  0.1431180282
## mmr      5.655734e-03 -0.9916320120 -1.260302e-01  6.100534e-03  0.0266373214
## abr      1.233961e-03 -0.1255502723  9.918113e-01 -5.301595e-03  0.0188618600
## prp     -5.526460e-05  0.0032317269  7.398331e-03  9.971232e-01 -0.0716401914
## edu.rat -5.607472e-06  0.0006713951  3.412027e-05  2.736326e-04 -0.0022935252
## lab.rat  2.331945e-07 -0.0002819357 -5.302884e-04  4.692578e-03  0.0022190154
##                   PC6           PC7           PC8
## gni      2.711698e-05  8.075191e-07 -1.176762e-06
## leb      1.453515e-01 -5.380452e-03  2.281723e-03
## eye     -9.882477e-01  3.826887e-02  7.776451e-03
## mmr     -1.695203e-03 -1.355518e-04  8.371934e-04
## abr     -1.273198e-02  8.641234e-05 -1.707885e-04
## prp      2.309896e-02  2.642548e-03  2.680113e-03
## edu.rat -2.180183e-02 -6.998623e-01  7.139410e-01
## lab.rat -3.264423e-02 -7.132267e-01 -7.001533e-01
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.6), col = c("brown", "blue"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Standardization

The result can’t be explained, so we need to scale the data.

# scaling the variables
human_scaled <- scale(human)
summary(human_scaled)
##       gni               leb               eye               mmr         
##  Min.   :-0.9193   Min.   :-2.7188   Min.   :-2.7378   Min.   :-0.6992  
##  1st Qu.:-0.7243   1st Qu.:-0.6425   1st Qu.:-0.6782   1st Qu.:-0.6496  
##  Median :-0.3013   Median : 0.3056   Median : 0.1140   Median :-0.4726  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.6717   3rd Qu.: 0.7126   3rd Qu.: 0.1932  
##  Max.   : 5.6890   Max.   : 1.4218   Max.   : 2.4730   Max.   : 4.4899  
##       abr               prp             edu.rat           lab.rat       
##  Min.   :-1.1325   Min.   :-1.8203   Min.   :-2.8189   Min.   :-2.6247  
##  1st Qu.:-0.8394   1st Qu.:-0.7409   1st Qu.:-0.5233   1st Qu.:-0.5484  
##  Median :-0.3298   Median :-0.1403   Median : 0.3503   Median : 0.2316  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6030   3rd Qu.: 0.6127   3rd Qu.: 0.5958   3rd Qu.: 0.7350  
##  Max.   : 3.8344   Max.   : 3.1850   Max.   : 2.6646   Max.   : 1.6632

We can see that all the means equal zero.

Then we perform PCA again but on standardized data

# perform PCA (SVD method)
human_pca <- prcomp(human_scaled)
human_pca
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                 PC1         PC2         PC3         PC4        PC5         PC6
## gni     -0.35048295 -0.05060876  0.20168779 -0.72727675  0.4950306 -0.11120305
## leb     -0.44372240  0.02530473 -0.10991305 -0.05834819 -0.1628935  0.42242796
## eye     -0.42766720 -0.13940571  0.07340270 -0.07020294 -0.1659678  0.38606919
## mmr      0.43697098 -0.14508727  0.12522539 -0.25170614  0.1800657 -0.17370039
## abr      0.41126010 -0.07708468 -0.01968243  0.04986763  0.4672068  0.76056557
## prp     -0.08438558 -0.65136866 -0.72506309  0.01396293  0.1523699 -0.13749772
## edu.rat -0.35664370 -0.03796058  0.24223089  0.62678110  0.5983585 -0.17713316
## lab.rat  0.05457785 -0.72432726  0.58428770  0.06199424 -0.2625067  0.03500707
##                 PC7         PC8
## gni     -0.13711838 -0.16961173
## leb     -0.43406432  0.62737008
## eye      0.77962966 -0.05415984
## mmr      0.35380306  0.72193946
## abr     -0.06897064 -0.14335186
## prp      0.00568387 -0.02306476
## edu.rat  0.05773644  0.16459453
## lab.rat -0.22729927 -0.07304568
# drawthe biplot from the PCA result and the original variables
biplot(human_pca, choices = 1:2, cex = c(0.4, 0.6), col = c("brown", "blue"))

Now the plot looks much better in terms of explanation.

We can conclude the following points:

  • maternal mortality ratio mmr, adolescent birth rates abr, ratio of labour force participation of females and males lab.rat contribute to the first principal component (PC1)
  • life expectancy at birth leb, gross national income gni, expected years of education eye, female percent representation in parliament prp, the ratio of female and male populations with secondary education edu.rat contribute to the second principal component (PC2)
  • maternal mortality ratio mmr and adolescent birth rate abr go to one direction, so they present strong positive correlation; however, both variables show strong negative correlation with all other variables
  • life expectancy at birth leb, the ratio of female and male populations with secondary education edu.rat, gross national income gni, expected years of education eye have strong positive correlation as far as these four variables go to one direction
  • female percent representation in parliament prp and ratio of labour force participation of females and males lab.rat have strong positive correlation because of the arrows’ direction

Personal interpretations of the first two principal component dimensions:

  • the first principal component shows the high level development of some countries (e.g. Sierra Leone, Burkina Faso, Republic of Chad, etc.) in terms of adolescent birth rates abr and maternal mortality ratio mmr
  • the second principal component captures the countries that are more developed

Multiple Correspondence Analysis (MCA)

We download the tea dataset

library(FactoMineR)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# new dataset with selected columns
tea_time <- dplyr::select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300   6

Then we visualize the dataset:

library(ggplot2)
library(tidyr)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Now we performe the MCA:

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

We can see the strongest correlation of “how” and “where” with the first dimension and these variables are also correlated with the second dimension.

Let’s visualize MCA by categories and individuals:

# by individuals
plot(mca, invisible=c("ind"), habillage = "quali")

# by individuals
plot(mca, invisible=c("var"), habillage = "quali")

We can see that “how” and “where” are the most similar categories. We can conclude that

  • people who buy tea in chain stores drink tea bags, and
  • people who buy tea either in chain stores or in tea shops drink tea bags and unpackaged tea. - - people who buy tea in tea shops drink unpackaged tea
  • people drink just pure green tea without any additions and as for black and earl grey they add milk and lemon

First dimension explains 15% of variance, and the second dimension - 14% of variance.